Machine Learning, Copula and Synthetic Data

Synthetic data is one of those ideas that sounds like it shouldn’t work — if the data is fake, how can a model trained on it generalize to real data? The answer is that “fake” here doesn’t mean arbitrary. Synthetic data generated properly preserves the statistical structure of the original: the marginal distributions of each variable, and the dependence structure between them. If you get that right, the downstream model often can’t tell the difference.

Why bother? Two reasons come up constantly: small samples and privacy constraints. When you don’t have enough data to train a reliable model, generating additional synthetic observations from the learned distribution can help. When your data contains sensitive information that limits how freely it can be shared or used, synthetic data gives you a safe stand-in.

Copulas are the right tool for this. A copula separates two things that are usually bundled together in a multivariate distribution: the marginal distribution of each individual variable, and the dependence structure among them. This separation matters because real data is rarely multivariate normal — the marginals may follow different parametric families, and the correlations in the tails may differ from correlations in the center. Copulas handle both.

To give a concrete example, let’s use a few variables from the classic Cars Dataset.

import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from copulas.multivariate import GaussianMultivariate
from copulas.univariate import ParametricType, Univariate

df = sns.load_dataset("mpg")
df=df.drop(columns=['origin', 'name'])
df=df.dropna()
df.columns

df=df[['horsepower', 'weight','acceleration','mpg']]
df.describe()
horsepower weight acceleration mpg
count 392.000000 392.000000 392.000000 392.000000
mean 104.469388 2977.584184 15.541327 23.445918
std 38.491160 849.402560 2.758864 7.805007
min 46.000000 1613.000000 8.000000 9.000000
25% 75.000000 2225.250000 13.775000 17.000000
50% 93.500000 2803.500000 15.500000 22.750000
75% 126.000000 3614.750000 17.025000 29.000000
max 230.000000 5140.000000 24.800000 46.600000

Let’s look at the distribution of each variable, the pairwise scatter plots, and the correlations.

def corrdot(*args, **kwargs):
    corr_r = args[0].corr(args[1], 'pearson')
    corr_text = f"{corr_r:2.2f}".replace("0.", ".")
    ax = plt.gca()
    ax.set_axis_off()
    marker_size = abs(corr_r) * 1000
    ax.scatter([.5], [.5], marker_size, [corr_r], alpha=0.6, cmap="coolwarm",
               vmin=-1, vmax=1, transform=ax.transAxes)
    font_size = abs(corr_r) * 10 + 5
    ax.annotate(corr_text, [.5, .5,],  xycoords="axes fraction",
                ha='center', va='center', fontsize=font_size)

sns.set(style='white', font_scale=1)
g = sns.PairGrid(df[['horsepower', 'weight','acceleration']], aspect=1, diag_sharey=False)
g.map_lower(sns.regplot, lowess=True, ci=False, line_kws={'color': 'black'})
g.map_diag(sns.kdeplot, color='black')
g.map_upper(corrdot)
plt.show()

A few things stand out. The variables are strongly correlated in some pairs. The distributions are different: acceleration is close to normal, but horsepower and weight have heavier right tails. A Gaussian copula with normal marginals would miss that — which is why we want the copula to handle marginals separately.

Before generating synthetic data, let’s fit a simple linear regression on the original data as a baseline.

y = df.pop('mpg')
X = df

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30,random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)

print(model.score(X_test, y_test))
0.6501833421053667

Now we generate a synthetic training set using a Gaussian copula, where the marginals are estimated separately as parametric distributions (not KDE). The copula model will fit the best-fitting parametric family for each variable, then learn the correlation structure among them.

# Select the best PARAMETRIC univariate (no KDE)
univariate = Univariate(parametric=ParametricType.PARAMETRIC)


def create_synthetic(X, y):
    """
    This function combines X and y into a single dataset D, models it
    using a Gaussian copula, and generates a synthetic dataset S. It
    returns the new, synthetic versions of X and y.
    """
    dataset = np.concatenate([X, np.expand_dims(y, 1)], axis=1)

    distribs =  GaussianMultivariate(distribution=univariate)
    distribs.fit(dataset)

    synthetic = distribs.sample(len(dataset))

    X = synthetic.values[:, :-1]
    y = synthetic.values[:, -1]

    return X, y, distribs

X_synthetic, y_synthetic, dist= create_synthetic(X_train, y_train)

Let’s inspect what the copula model learned — the fitted distribution for each variable:

parameters = dist.to_dict()
parameters['univariates']
[{'a': np.float64(2.505456580509649),
  'loc': np.float64(44.28600428264269),
  'scale': np.float64(24.186079118156652),
  'type': 'copulas.univariate.gamma.GammaUnivariate'},
 {'loc': np.float64(1604.636580448248),
  'scale': np.float64(3779.05677499984),
  'a': np.float64(1.470861609942495),
  'b': np.float64(2.520267107805836),
  'type': 'copulas.univariate.beta.BetaUnivariate'},
 {'loc': np.float64(1.0632403337723968),
  'scale': np.float64(73.46060125357005),
  'a': np.float64(20.470245095113114),
  'b': np.float64(83.54399672283503),
  'type': 'copulas.univariate.beta.BetaUnivariate'},
 {'loc': np.float64(9.84592394053172),
  'scale': np.float64(40.487634662917245),
  'a': np.float64(1.6832256330594189),
  'b': np.float64(3.231729039577079),
  'type': 'copulas.univariate.beta.BetaUnivariate'}]

And the learned correlation matrix that defines the joint structure:

parameters['correlation']
[[1.0, 0.8473079530880548, -0.720090874761445, -0.8315209336313288],
 [0.8473079530880548, 1.0, -0.42047354869738274, -0.8311150310370441],
 [-0.720090874761445, -0.42047354869738274, 1.0, 0.4357973452281482],
 [-0.8315209336313288, -0.8311150310370441, 0.4357973452281482, 1.0]]

Now compare the synthetic data against the original — same summary statistics, same pair plots.

syntDF=pd.DataFrame(np.concatenate([X_synthetic, np.expand_dims(y_synthetic, 1)], axis=1),columns=['horsepower', 'weight', 'acceleration','mpg'])

syntDF.describe()
horsepower weight acceleration mpg
count 274.000000 274.000000 274.000000 274.000000
mean 106.741787 3001.165321 15.282710 24.006457
std 38.396280 834.124368 2.897945 8.008243
min 50.062733 1626.735130 9.532103 10.022592
25% 78.610155 2274.718738 13.244567 17.709268
50% 97.469412 2887.631374 14.938047 23.348269
75% 127.881791 3659.378723 17.208566 28.896533
max 261.357416 5142.185723 26.928153 44.491918

The descriptive statistics are close. The pair plot shows the shapes are similar but not identical — the tails in particular may differ somewhat, which is expected given the Gaussian copula assumption on the dependence structure. If the tail behavior mattered critically here, a Clayton or Gumbel copula would be more appropriate.

Finally, re-fit the same linear regression on the synthetic training data and evaluate on the original test set:

model = LinearRegression()
model.fit(X_synthetic, y_synthetic)

print(model.score(X_test, y_test))
0.6343060805430143

The R² is very close to the one from the original data. Not identical — there’s some information loss in the synthetic generation process — but close enough to demonstrate the point: the copula has preserved the statistical structure that the regression model actually needs to make good predictions.

This matters in practice. Synthetic data generated this way isn’t just noise with the right summary statistics; it captures the correlations and distributional shapes that determine model behavior. For augmentation, privacy-preserving data sharing, or testing model robustness, that’s exactly what you need.